~ DOCmif^lBSOBB 

-■ . \ ■ , 

U) 152 a3B- . . • Ti 007 050 

,•••••'•) ■ • 

JLOIBOB, ., ^ Hiller, John K.; Knapp, Thoias B. 

TItLB' ' Xhe Zapottance of Statistical e,o«ef in Edncational 

' ' ! 'Besearch. Occasional Faper 13. 

XHSTXtOTXOH ^hi Delta Kappa/ Blooiington, 'Ibd. c / ' 

MOTE 35p. ' 

^VAII'&BIB FBOH Phi Delta i(app a. Eighth and Onion, Post Office Box . 

789/ Blodaingt'oQ, Indiana 47((0t (i1.25r, 

BOBS "PBICE- • HP-$0.83 HC-$2.06 Plus Postage.. 

OBSCBIPfOBS Bayesian Statistics; ^Decisiog Baking;. Edocational 

Besearch; *Hypothe.sis Testing ;%flatheaatical Bodels; 
*Besearch Design; Bisk'; Sampling; Sffatistical 
Analysis; Statistics; *Tests of Significance 

IDEHTIPIEBS . ^Statistical Power; Type r Erxors; ♦Type II Errors 

Ihe testing of Research hypotheses is directly 
coiparable to th« dichbtoioas decision^iaking of ledical diagnosis or 
jury trials — not ill/ill^ or innocent/guilty decisions; THere are 
costs, in both kinds of errot^ type I errors of falsely rejecting a' 
null hypothesis or type II errWs of falsely rejecting an alternative 
,hypothesls» .It is. isportant to consider .the. power of a statisl^ical 
test (that is, the likelihood <if aioiding type II, errors) ^'as veil as 
the significance lei[el (the likelihood of avoiding type errors) 
before an expef isfnt is begun. Techniques for estiiating^the povi^r of 
a statistical test for a particular experimental design are 
prejsentede (CTH) 



* ^ Beproductions supplied by EDBS are the best that can /be rade ♦ 

♦ . ' ' fros the original doausen^t. ^ ♦ 



US OEPARTMENTOF HEALTH. 
EDUCATION 4 WELFAKC 
NATIONAL INSTITUTEOF 
( EDUCATION 




\ 



, OCCASjONAL PAPER 13 

THE IMPORTANCE OF STATISTICAL POWER 
IN EDUCATIONAL-RESEARCH . 

by 

s 

^ John K. Miller 
Thomas R. Knapp 
Univers/ty of Rochester 




' . ' , ' PREFACE 

. . . 

^.The problem of non-sigmficant results in educationai^XR^n- 
mentation has ^perplexed researchers for several decades. Two 
mistakes ar^ made in connection with such findmgs. The^irst is 
the mistake of concluding that m fact there are no differences 
between the treatments investigated. A non-sigmficant'l or F 
statistic does ndt mean that there are no differences. Rather, such 
a statistic states that in U>ir«ituation we ar^ unable to reject the 
hull hypothesis! Faihire t6 /eject is not tKe same'as acceptance.^ 
The second mistake is one of advanced planning. There have been 
numerous educational experiments conducted 'm which the 
probability of observing a real difference ws practically nil from 
the startj a situation that results from tl!e unfortunate choice of 
sample size^ alpha level and desired mean differences. Knapp and 
Miller have. focused on this problem in their treatment of the 
concept, the power of tlje' test. Educational researchers can 
improve their efforts if they willlise the information presented 
here by those two as\ducational experiments are being planned 

William J. Gephvt 
Director of Research Services 
, Phi Delta Kappa 



.4 



I. IKITRODUCTION 



A. Human Decision- naking \ . • 

One of the most! certain' but imperfectly appreciated facts of 
life is the uncertaiQXy of human knowledge. The concisions we 
draw about the \votld around us are constructed from the data of 
experience, assembled &it by isolated bit into explanations of 
how ^d why things are as they are. We simply do not possess the 
perceptual equipment for verif^^ing the truth or falsity of any 
generalization except by ((^e cumulative weigKt of highly specific 
observations. ; 

As a c(Jnsequence the eviden'^e we gather in the effort to 
generate ^d verify solutions- to problei^ is rarely, -if ever, 
deterministic. At best the assurance afforded 'by such evidence is 
probabilistit. This means, in .the* absence of certitude, that 
preference is accorded to solutions most consistent with available 
evidence and least subject to inexplicable phenomena. Withiiut 
taking into account every* single piece of data relevant to the 
adequaie solution to a problem, one must always anticipate the 
possibility of finding exceptions, of discovering more comprehen- 
sive answers, or even of confirming an altogether, different ex- 
planation. 

* By the same token it is often difficult to rule out' alternative 
explansitions completely. The amount or quality of evidence 
available to illuminatd the inquiring mind is often inconclusive, 
providing insufficient basis either for affirming pr denying some 
explanation categorically. And yet we can rarely allow ourselves 
the luxury of equivocation. We imust act and make decisions; and, 
lacking ctertainty, if we are to act rationally we play the odds. 

When a physician, for instaYice, decides upon f diagnosis in a 
.difficult case, he wtfjghs the evidence favoring each possible 
diagnosis, that a patient -does or does i)Ot suffer from a particular 
disease, or that the patient is afflicted by oae illness rather than 
another. The danger of error in diagnosis, prescription, and 



-2- ' ' ' ■> ' 

prognosis is always present, no matter what conclusion tiie 
physician ultimately selects. His responsibility is to select the 
alternative that simultaneously minimizes the prQspect of, error - 
and maxinizes the' likelihood of healing or prolonging life. The 
doctor finds himself, in fact, sftbject to two quite'different but' 
complementary types of error. When there is suspicion of cancer, 
for example,, the physician can arnve at one of two diagnoses - 
that the patient suffers from the disease or that he does not. The 
erroneous diagnosis of malignancy could lead^to unnecessary 
endangermen't of health and life through medication, radiation 
therapy, or surgery. On the other hand, the eri;oneous conclusion 
that th^" .symptCfns do not point to canger could result in 
surgically preventable metastasis and prematxire death* for the 
patient. ^ / • 

Clearly the physician wishes to avoid making either type of 
error. The trouble is that avoiding one of them is inversely related 
to avoidance of the other. In other words, the more tests he^runs, • 

' or the more evidence he dernands, or the tonger he waits before 
making a definitive diagnosis of illness and initiates treatment, the 
greater the 'danger th^t the ravages of disease will progress 
meanwhile to an irreversible st^ge. The physician will have clung' 
too tenaciously to the a priori hypothesis tha; his patient is not 
seriou^sly^ ill Yet if the doctor relies on purely superficial or 
ambiguous symptoms, he will readily evade tj-agic delays in 
diagnosis, but he may greatly increase the frequency with which 

\he mistakenly treats patients for illnessp they do not have, j 

The physician's dilemma merely exemplifies, in a context 
nia^de dramatic by the^mportance of the decision he must make, 
the essential peril of the dichotomous decision. Faced with v 
deciding between two mutually ^exclusive alternatives, supported 
«only by inconclusive evidence, the selection of either alternative 
coulfl constitute a mistake. The only reasonable solution to this 
dilemma js to effect^ an acj^eptable ^compromise between the 
danger of erring in either direction. ThfTlran .actually be 
accomplished only by taking account of a number of factors: 

1) Explicit identification of bothlources of error; 

2) Assessmen;! of the severity or cost of each type of error. 



ERLC 




3) Comparison of tfie cost of making one enor against the cos{ 
of making the- other iC determine wliich, if either, is more 
tolerable; 

4) Specification, as precisely as possible, of how much danger 
* making either type 'of error <ran be tolerated wherf a 

decision is finally made; 

5) Control over the inquiry^ptocess in a manner that insures 
adherertSe to specified error tolerances. , ' 

B. Scientific Decisi^on-making 

* In the natural and behavioral sciences, the testing otn'esearch 
' hypotheses is directly comparafite to the diqhotomous de^i- 
, sion-mal0ngwprobleni of medical diagnosis. When the scientist 
'systematically pursues the goal of verifying expectations regard-" 
ing, for instance,' the superiority of one instructional method over 
another, the very nature of his evidence imperils any conclusion ;^ 
he might reach. Should evidence favor the superiority pf one 
metho<i, there 'always lurks in the background the possibility that - 
uncontT611ed, undetected, irrelevant factors are operating to 
produce only flie appearance of greater instructional effective- 
ness.* Cfn the other hand/ even should evidence fail to support or 
""appear to refute the anticipated ouJ^ome, thel'ault might not lie 
with th^^ anticipated outcome. It could be the result of defects in 
the quality of the evidence gathered, or its quantity; or even the 
manner of obtaining it. - " 

The purpose of scientific inqutty, p^rticularly*the investigative 
techniques we characterize as*experimentation, is the acquisition 
or extension of knowledge by adherence to formalized proce- 
dures that minimize the prospect of reaching enoneous conclu- 
sions. KerHnger (1964) "points out that the good, scientific 
experiment is designed"to*let the facts, and only the facts^ speak 
for themselves. Ideally this means that such control is" exercised 
over the sources of errpr in human judgment that the dat^will ' 
actually support a true hypothesis and will not suggest that an 
erroneous one is true. * 

By .using highly refined and time-tested methods 6f inquiry, 
science -seeks lo neutralize extraneous or irrelevant factor^ tha^t 



■ f 



o 




commonly contaminate niore casual methods of acquiring knowl- 
edge. In pursuit of this ideal, investigatwfs are ccmfronted, more 
often than tiot, with the task of gathering, organizing, synthesiz- 
ing, evaluating, and'interpre'ting an unwieldy , array of evidence; 
Evei> relatively simple problems would defy experimental 
solution were'' it not for the susceptibility of the data of empirical 
science to quantification. The ability 'to summarize his ob^rva- 
Jions in the form of measurements places a. powerful analytic 
tool - statistical inference — in the harids of the scientist. ' 

II. STATISTICAL INFERENCE 

As we've noted, to' produce an acceptable diagnosis the 
• physician must evaluate the symptoms he observes against 
diagnostic alternajtiyes. The data of science are weighed statistical- 
ly for their comparative consistency with certain hypotheses, 
plausible dternatiye solutions to scientific questions. To accom-. 
plish this purpose there is available a wide range of statistical 
techniques classified in general as hypothesis -testing procedures. 
These methods enable us to determine thp likelihood that the 
laws of chance alone suffice to account for the occurrence of 
some event. Statistical analysis would indicate, for example, 
that the' emergence of a seven on ten "successive throws of the 
dice at the gaming table has a very low chance expectancy. 
Quite logically the gambler would be disinchned to attribute the 
event to chance and would favor the suspicion that the dice were 
loaded. Similariy we might discover statistically that thp differ- 
ence in- learning rates associated with different instructional 
methods is. too gre'^at ia readily attribute to fortuitous, non- 
ins trtrctional aspects of pupils' learning experiences. If the 
findings had beeii^rrectly anticipated,, then the investigator 
could claim le^^tiSately to Iiave "confirmed" his (lypothesis 
faVoring- the effectiveness of one method over the oth6r. j 

These simple examples point io the essential property of the 
hypothesis-testing mod^l. Scientific hypotheses are not validated 
^or invalidated in and of themselves. They are confirmed or 
disconfirmed orjy in a 'relative sense, i.e., relative to some explicit 



alternative, somp competitive hypothesis. Statistically the investi- 
gator did not compare -two* teaching methods in the example just 
cited; he compared two hypotheses. This distinction is a crucial 
one. Failure to recognize its 'impUlltions h^ Fesulted in a 
widespread practice of ^doifig only half^the job*' that a scientific 
experiment is capable of accomplishing. . 

A. Null and alternative hypotheses 

^ r ♦ • . 

' Conventionally, in the course, of his investigation, the scientist 
identifies two 'mutually Exclusive hypotheses: one called the 
^alternative hypothesis (H^), which he is inclined to believe wi]f 
stand; and onexalled the null hypothesis (Hq) 'which he is inclined 
to aisbelieve. The statistical (wmparison of the*null and alterna- 
tive hypotheses results in nothing more or less than a' simple 
probability statement - thp degree to which *a phenomenon, such 
as the difference observed between two learning rates, would be 
expected "to occur purely by chance. However, the scientist's 
principal task is to fchoose between Hn and Hi . * 

B. Significance and Type h Error * . ' , 

* » • ' 

In relation to ^at task the statistical probability statement can 
be interpreted as the likelihood of making an error when the null 
hypothesis is rejected and the altemaiw, accepted. For example, 
if this probability is found to be .03 (3%), and* the investigator 
decides to place his trust in the alternative, hypothesis that the 
learning rates do differ as a result of different methdds of 

'instruction, he runs a 3% risk of error. Such an error is known as' 
Type I ErrSr, 

In order to decide between the npll and the alternative, the 
researcher must have a decision rule, a policy for determining 
when to side with^H). He selects some small value, o'rdinarily 5% 
or 1%, as the risk of Type I Error that appears, under tjie 
.circumstances, to be reasonable or tolerable. This risk is com- 
monl>i called the "level of significance'' in slatistical jargon and is 
designated by ^he Greek letter alpha (a). In effect, when the 
^ probability of committing Type 1 Error is foun4 to be as small as 
the risk an investigator is willing to take, he rejects 'the null 



V6-. • ' , ' 

hypothesis and accepts the' alternative bypotliesis. That is, he 
attributes hi« findings not to the operation of chance but to the 
sysfematic influence of factors associated with the alternative 
hypothesis. ^ 

When and if his observations lead the researcher to feject the 
null hypothesis (Hq) tfiere is, of course, cause for rejoicing. The 
evidence favqrs the solution he has formulated v to a scientific 
problem, ^t let us suppose, as is ahnost universally true, that the 
investigator has been satisfied with d^dgnijlg a study that 
cautiously guards only against the commission of Type L Enor. 
Suppose, further, that the outcome ot the data analysis doesno^ 
permit the rejection of the .null hypothesis, i.e., that the 
predetermined "levfel of significance" is not attained, so '^at 
adherence to the alternative hypothesis would entail an unaccept- 
able risk* of Type I Enor. Does this rigorous* control against 
Jnterpreting chance events^as experimental effects provide any 
insurance against attributing real experiinental ^ffects to chance? 
Unfortunately, but emphatically, it»does not. Evidence that Hq is 
not supported by the data does, indeed, enhance the credibility 
,of HfsB*!^ failure to reject does not make it credible, or give 
cause for rejecting the credibility of H|. It would,'^for example, 
be flirting with disaster for a physician to reject the diagnosis of 
cancer in the face of some, but not all, of the positive clinical 
signs' of the disease. Similarly, failure to fmd the evidence 
required to confirm the superiority of one instructional method 
over another does not refute its superiority. Botfi these instances 
exen^lify a basic principle- of scientific inquiry, indeed of 
inductive reasoning in general. Failure to confirn;)-any hypotKesis 
docs not constitute evidence against it. There is a huge difference 
between knowing that something iz not true and not kfiowing 
that it £s true. . ^ • * 

V ' • ■ 

C. lypajl Error and the concept of power 



i 



* In ^limiting his sftsceptibility to Type I Error through specifica- 
tion of a stringent level of statistical significance, the scientjst 
buys protection f^se claims to the discovery of new 

explanations. When the criterion of significance (a) is not 
satisfied and the alternative hypothesis is unccHifirmed, however, 



ERIC - ' ^0 



' -7- 

the issue remains essentially unresolved. At best, the ifSvestigator 
nky have the qgtjDh of deferring\final judgment until more data 
bjeconie available. At worst,-Gonsiderations of practical necessity 
may require him to relinquish hisVesearch hypthesis. In either 
case it remains not .only possible but\plausible that the alternative 
(H^) is true but unsupported by the outcome of the experiment. 
If is, in fact, true, art error of tlife second kind (Type II Error) 
has been committed. Typically, when resuUs do not satisfy the 
criterion of significance, the experiifient is regarded as a failure. 
As Campbell and Stanley (1^3) point out, experimental. failures 
of this kind are more to be expected than experimental successes. 
This should be cause for neither surprise nor discouragement'. The • 
formulation of acceptable solutions to serious problems is far 
more difficuU^han formulating erroneous\or incomplete solu- 
tions. Howem, it is unfortunate that failure *to confirm hypo- 
theses has become equated with experimentai failure. An experi- 
ment truly fails ^ only if it fails to extend the horizons of 
knowl^ge. \ ^ • 

If one could be reasonably confident lhat a'^reWrch hypothesis 
remains unconfirmed because ii is actually false, the goals of 
science (if not the investigator's goals) would be fulfilled*. In other 
words, an experiment* designed to promote simultaneously "s 
confirmation or elimination of Hj never faih. Such a stud>^ 
contributes to knowledge by, proposing and supporting an 
acceptable solutiort to a problem, or by discovering' that the • ^ 
soluti^advaijced is inadequate. Either way the investigation is a 
productive enterprise. v , \ 

• ■ \ 

There is, however, only one^ way that an experiment jcan 
constitute such a hvo-pronged attack on ignorance. It must not* 
only controM^pe I I^rror; it must also control Type II Error. 
First, it is necessary ito face squarely the question of the risk 
involved in making a Type II Enor: how serious a^istake is ij to 
ipcrij>e the results" off an* investigation to the chance hypothesis 
(Hq), v^en the investigator's explanation (Hj) is actually an 
accurate one? Whaf odds does the physician require against 
diagnosing a malignant tumor (Hj) as some lesser 'illness 
Arc pdds of 9 to 1 fallowing a 10% probability qf Type II Enor, • 
acceptable? Or mu^ he be more cautious, demandi^j^^tter odds 




A 'ill 



(e.gi' 999^0 r) and a lower risk of error (e^., 0,1%)? If he is to 
retam the confidence" of his patients, th^ respecto^Chis colleagues, 
and his lioense to practice, his preference had better be for strong 
'safeguards against^ Type II Error. On the other hand, the" 
consequences of failing to verify the superiority of individualized 
.reading instruction may not be so dire.- A 10% risk of abandoning 
a successful innovation may be acceptable, if the conventional 
method already in use is known to be reasonably effective, 

X* In (he Application of stat^^Jical hypothesis-testing procedures it 
is possible to p^ace restraints upon the commission of TVpe 
Error, just as it is possible to limit susceptibility to Type I Error. 
TheTisk of making a Type ILError is designated by the Greek 
letter beta *( g), ^nd, like the level of significance- (a), it. 
represents a simple probability statement — the probability of 
failing to reject an erroneous null hypothesis. Statistical power. 
(1-6) is a^ correlative concept iadlcating the probability^, of 
'rejecting^ an erroneous null hypothesis in favor of an alternative 
hypothesis. The physician must ordinatily demand great power of 
his ^agnostic decision in cases where failure to discover a seuous 
illness (Type II Error) m^t endanger life.- * ' , 

Power must also be high in pharmaceutical experiments in: 
order to evaluate the side effects of medicinal drugs. Fn^Europe 
the disastrous effects of prematurely marketing the tranquilizer 
thalidomide is" a classic example of the implications of experi- 
nientaN. power. Failurfe to reject the safety of , the drug,duriiig 
pregnancy ancT to discover its damaging effect upon^tJie fetus 
(tl|) was a cOstly Type iPError. Had more extensive resea'rch 
bpen undertaken, requiring uneventful consumption by a larger 
number of subjects, power would have inci'eased, failur^ to detect, 
fetal damage might have been avoided, and the^drug might never 
have beeji erroneously judged safe for distribution. 

Though. Type 11 Error in educational and p«ychplogical 
research is rarely accompanied by the danger of such greal and 
irrevccs'ible effects, the difference is" one of degree rather than 
kiiKi. The s^riouiwiess of tiic potential error determines //o\v much 
jvnver i^ necessary,^ But any expenment designed without 
Npecilyinu .i' level of po\Ver propoltuMial* to t^it necessity is 

12 . ■ ■• 



Inherently w^ak. Let us suppose that the level of power to bfe 
required V in an educational experiment comparing the effect 
of televised instruction and programmed instruction has been 
'defined. And let ifs suppose, for purpose's unique to this mvesti- 
gatioU, ,that the level is a very*demanding one for a learning 

, experiment (Power = .99). What factors influence the attalnr^ent 
of desired power? They areVfour: L) the significance level or risk * 
of Type I Error that is deemed tolerable ;*2.) .the difference 
between ^e two treatment-population means on 'the. learning 
criterion; 3) the^ variance in the papulation on the learning 
criterion; and 4) the nunfcer of subjects in each of /he treatment 
^oups. The selection and attainment of some preferred level of 
power requires-th^f e:ach of these four values be specified in 

• advance. • v 

The choice of significance level is, or should be,, a prudential 
^ judgment b^sed on assessment Of the consequences of committing. 
a Type Error. The more severe the impact of erroneously^ 
rejecting the null, the more rigorous the level, of significance must 

The smallest'.difTerence^ between means 'that' would be of 
interest to the ijiyestigator must b«. selected. This 'represeitts the' 
precise definition^ of a. .very specific alternative hypothesis. It 
> should'be not?ed that the exercise-of control 'over statistical power ' 
.precludes the fprmulatlon of the more conventional, non-specific 
alternative hypothesis. Neither the So-<^alled "two-tailed" alterna- 
tive (i.e., that the^mean difference does not equal zero), nor a^ 
''one-tailed" hypothesis (i e.,. that thd mean- difference favors one t 
treatr^eht over the, other) is sufficient. All'.i>thfer Ihmgs^being 
equal, the power of the decision tQ^ accept or ra^6ct the null 
hypothesis is directly affected by variation m the!' difference 
between r^eans. " ' . 

Power is sensitive, however^ not only to differences between 
means, bui to variation in peifoimu^a^mong individuals as well. 
The wider the range of "p^J^mance^to be found on the 
experimental criterion variable, the greater the danger of commit- 
ting' Type H Error. It is tlferefore, necessary, by rational or* 
empirical means, to anticipate the extent of individual difference^ 

■15 • . 



-10- , ' ' 

(the variance) characteristic of-the criterion measure. , , 

Finally, variations in sample size also affeqt the ^ power of 
inference. However, when "desired levels of power and significance 
have been pre-selected, and when a mean difference worthy of 
the investigator's interest has been d&finqjd, and when a reason- 
able variance estimate available, then the number of experi-' 
nientarsubjeCts needed is no longer free to vary. In fact, the most^ 
feasible' method for bringing Type U* Error under control -^s {o 
calculate and select* exactly that number of subjects required ^to 
satisfy th6se conditions. 



IIL AN ANALOGY AND AN APPLICATION 

In a- jury " trial the> guijding (null) hypothesis is that the 
defendant is innocent until judged guilty .*Jf at the conclusion of 
a trial the defifndant is not judged to ^e guilty, i.e. tlie null 
hypothesis i^ not rejected, there ai^ iw^o possible, explanations:, 
(1) the defendant is in fact inr|Ocent, i,e, the null hypothesis is 
true; or (2) the defendant is in^fact guilty .but an error (Type 11) 
in |udgment has been made, i.e. a false null hypothesis has been 
jetained, because of insufllcient or inconclusive^dence. 

• Similarly, m experimental leducational research the guiding 
(null) hypothesis is that there is no difference bet\yeen (among) 
the expenmental treatments. If at the conclusion of the ^eri- 
inenl oi^re oii the treatmjtits is not judged to be better than the 
-ollrer(s), i.e. the null hypotheses is nbt rejected, there are t|ivo 
possible explanations: (1) the treatments are in fact equally 
effective, i.e. the null hypothesis is tnje; or (2) one^ of the 
lreatnients;is in fact better than the other(s) but an .error (Ty^e 
II) in judgiT^ent has been made, i.e. a false null hypothesis has 
been retained" fcecause o^ insijfficient evidence (sample size too 
small, reliability of the dependent variable too low, etc.). ' 



*The expression. '1nnt>cent* untH proven guilty" is used more 
often, but "proof* iiiany absolute sense is impossiblVio establish 
HI a jury trial or^ii*^?cpSrimenjal educatiooal research. 




14 



what hUwrclevince of power to both these rilUations? For . 
the jury Utal, sodely, through iU representatives (the prcsccuting - 
and defense attorneys; the judp, the Jury) must decide, before 
the frkl begins, what risks it is willio| to assume as far as both 
Type I Enor (declaring as.gftllty an Innpcent dcrciidaiii) and ' 
Type II Error (declaring as Innocent a guilty defendant) arc 
concerned, and act accordingly. It must choose a decision making 
procedure (trial) with appropriate power (by considering a large 
^ • amounj of evidence In a long trial', for example, if both errors are 
V ^ry serjjous -and equally serious, and> if a fine discrimination 
• ; between guilt and Innocence must be made). For jsxperlmental 
O . educatlcma^ research, the' scholarly community, through Us 
• representative (the researcher) must decide, before the experi- 
ment begins, risks U fs willing to assume as far as bot(i Type 
, I Enor (declaring as different equally effccthre treatmcnis) and 
Type II Error (declaring as equal differentially effective treat- 
, ' ^ ments), and alsa act accordingly. It musfalso choose .1 decision- / 
J making procedure (significance test) with appropriate poweV (by 
\' drawing a targe Sarmkt^f subjects and u^jng preplse ' measuring 
* instruments, for e)aiTip|e, Jf both errors are sejjous. and If a 
rather, flne 'discrimination between effectfvene^ and superi- 
V'^ty must be made). , . ^ 

Table 1 explores, step* by i{ep. the parallels between a jury 
^ , trial «td an cfxpcrimental educational r^arch fcvestlgation into 
, ihi problem of the relative effectiviriess of ivfo teaching methtxis. 
, e.g/4elevised instruction programmed bistructlon. for a parti- 
cular unit of a course In, say, secondary schod (tenth grade) 
.biology. 




^12- 



TiWf t: Analofy Bf tw«tfi Jury THst and 



Jury TfW ^ 



Noll fiypolhesls: 



The defendant 
innocent.' 



fill llii- ,ncr.ij:c. televised 
ttisiriK ttnii :huI program* 
nu'tl irtNirtK (inn ;tre equally, 
efrevHvc: 



Alt«rnttWe 
Itiesia: 



*yp«- 



The defendant 
piiHty. 



On the average, televised 
lnslruct(on is t^n pointo 
.mure effective than pro- 
grammed tnstritctinn; 



KHiM Type 1(a) 
and^Type II ( p) 
error: 



11(1 th errors, vIk. 
deciding that an 
iniUKrent person is 
KUilty (Type I Er- 
riir) and^ deciding 
that a guilty person 
is innocent (Type tl 
Error), are very 
seri^!^ and equally' 
serious.* Therefore 
both a and 0 
should be el)ually 
smalt. « , 



The cost of making a Type^ , 
I Error, viz/dectd^ng that' 
one method ia better than ' 
the other when In f^ct they, 
are equally effective^Js not 
substantial, but the.^t of 
making a Type.tl Eri;pr it; 
vix. 'deckling that *these 
methods are equally effec- ^ 
tlve »when in fact televised 
instruction Is * actually 
better. Therefore 3 shouM 
bn. smaller than a«^y 
as compared ^to .OS, neces- 
sitating the choice jof a 
istgnifici^ce test. (of^« 
tailed! with povrei'^ » ^99 
(-I P). \ 



assoAciii 



Selection of simj^e 
sl/e: 



The only way to 
keep a kpd P 
equally smaft Is to 
have a large N, I.e. 
to collect a Isrge 
, amount of evidence 
snd conduct a long 
trial. ^ 



See Figure. I and 
calculations in text. 



ted 

\ 



•Svhtfff and most present-day liberals argue that deciding that an 

inniKtfnt lu-rstin 1% guilty Is a much more serious error (Type I). We find it 
difficult ttt ma|te suc^ a value Judgment. 




.' 16 



3Ut|ttlc 



difference 
between the .e(i> 
dence in support of 
the defendant's guilt 
and the evidence in 
support oT^his in- 
nocence. 



•••-13- 

The dif fere nee' be'tween the 
meao scoresoft thfe TOUS 
for fwo independent ran- 
domly . assigned sa'mples. 



Decision rule: * 



If all jurors agree 
* ^that guilt has been 
^ established) reject 

Hp; olherwise^ ^re«^- 
- tain Hq: (Such a 
decision is c^ften 
based, conscioysly 
or "Un consciously, 
On the probability 
of the association, of 
tifb ' defendant with 
•the conditions* of 
the'crime.*) ' » 



If the difference between 
the sample means is greater 
than 4.14, reject Hq; other- 
wise, retain Hq. (The pro- 
bability of^e difference 
greater than 4.14 ^s less 
.^han .05 if Hq is true and is 
greater than .99 if H| is 
true.) 



•This point is illustrated in the following eiJcample taken from Kingston 

(1^65a)^ ^ V ^ . V 

Consider ... the 'following hypothetical case. An International 
courier, carrying a brie^ case cuffed to him, in which was carried a 
considerable^ amount of money and some secret documents, was 
murdered in London. The brief case was ripped open and the 
contents taken. The only evidence found by the investigating 
authorities was a latent fingerpnnt on the brief case tfiat could only 
have, been left by the perpetrator. It is calculated that the 
pr9bability of the ridge pattern shown by the latent print occufring 
^ by chance on any one person is about 1/(3X lO''). The latent print is 
filed with an international police record office* after an otherwise 
unsuccessful investigation: Every fingerprint filed coming to the 
attention of this office is compared with the latent print. One year 
after the crime, such a routine comparison turns up a fingerprint 
card which contains an Jmpressic^ area matching the latent print. 
This card was made from a person applying for a job in Washington 
D.C. He is immediately arrested a id ""bharged with the. crime. 
Considering that the above probability calculation is accurate, what 
is the chance that the above accusation is in error, where the latent 
print is the only .evidence that can be useil? 



-14- 

* " The principal problem is one. of solving for the N (assuming ^ 
equal sample sizes for the two groups) which is such that the 
point marked "X" (Figure 1) cuts off 5% (=a ) of the area in the 
\ right-hand tail of the null distribution and 1% of the area in the 
left-hand tail of 'the alternative distribution, i.e. (assuming 
normality for both sampling distributions) Znull =*+ l-6f5 and 
Zalt. = -2.327. llie standard error for each distribution (assuming 
equal variances) is givei^ by 



N N 



N 



1.414o. The required 



calculations proceed as follows: 



jnullXtribution) X-Q = 1.645 ->X =.2*226£L 
L414g 



(1) 



alternative distribution! X^aiL'= -2:327 = iz3.22Qil± 10 (2) 
1.414a 



. V 



Equating .the two expressions for.X we have: 



^ -3.290O . + 1D= 2.326tJ 

V 

or -3.290a. + IOn/IF = 

or lOV N 5.616a 

nrr. = .562a 



or 



I.e. 



N 



(,5620) 



2 = 



i336a 



316c? 



ERIC 



1^ 



-15-, 

Suppose that the dependeat variable te be measured at the 
conclusion of the experiment the performance of each subject 
on the T^t on Understanding Science (TOUS). The manual for 
that test (Cooley and Klopfer,. 1961) contains the mformation 
^ that for a normative sample of 1055 tenth-graders tht standard 
deviation is 7.66 (the test contains 60 multiple-chmc^ iteips). 
Substituting this value mto the expres€ion for N we haw^ 

N =^\316 (7.66)2 = (58.68) ' . . 

= 18.54 or approximately' 19 subjects per treatment 
group* . * / 

The value of X (the "critical" obtained difference between the 
means for the two samples) is fpund by substituting in equation 
^ (1), or equation (2), as follows: 

X • = 2.326(7.66) " 

■ = ^4.14 / 
or X = -3.290(7,66) + 10' ^ 



V 377 , ' 

" = /4.14 . ■ . .V • 

Thus, iji order to reject the null hypothesis at the .05' level of ' 
significance and the \. 99 power level whei^ ^ the ^investigator' 
hypothesized, a ten-point difference between means on a variable 
typified by a standard deviation of 7.66, the optimal number of 



♦The word "subjects" is used here in its most general sense, i.e. 
observations. The experimental unit may be an individual person,< 
a classroom, a '^chool, or what-have-you. Furthermore, the 
'subjects may be *^run" one-at-a-tinre or as an interacting group. 
Neither of these matters affects the statistical^de termination of N 
but both are of critical importance in the interpretation, of the 
data and the generalizability of the results. 



19' 



•ERIC 



•• -16- . ■ • 

*suj)jects in each of the gro\ips is 19. Moreover, when these 
conditions are satisTied, tlxe null hypothesis' would, ui fact, be 
rejected f dr an observed difference of 4.14 or larger. In other 
words, if ihe difference between the bvo methods were truly 10 

* '^criterion points, an observed diff/rence, of 4.14 would be 
s.ufificient to prevent the Type I Error rate from exceeding 5% and 
to prevent the Type II Error rate from exceeding 1%. 

' Even if the al'terriauve hypothesis is non-specific (e.g^ \ij - 
4, 9^ 0^ - 4, > Q, or - 4, < 0) and/or the population 
standard deviation unknown, one can solve lor N by using an * 
, -approximation procedure due to Cohen -(1969) which is also 
: .found in the recjent* elemenf&iy statistics text prepared by^ 
. ^WelkQwitZj Ewen, and Cohen (1971). All the- researcher need do 
' is specify a relative "effect size" (the ratio of a meaningful 
^ difference to the standaf^d deviation of the dependent variable; 
which would be **too good to miss" and use the tables'' which 

• Cohen has compiled tb find the N. that does the job. For our . 
example, if we were interested in detecting a *^large" effect 
(Cohen's -Y= .80)' we would ftnd' (Welkowitz et al., .L971,'p. ^ 

* 199) that: ' . . r 

> ? ^ 

2 ( 6=3.97fQra=.05,'^ne^tailed, ^ ' 

and power =^^.99)* ' 

2 (^^)2 * ' ^ . ' 

^ ^ ...80 ^ 

2 (4.96)2 • . * 

" ^ = -2 (24.60) 

= 49.20 or approximately 49 subjects per treatment group 

For a **smair effect, Y = .20 and N would be large (about 
7^^7 per group); for a *^mediam" effect, Y = /O and N would be 
about 126 per group. ^ ^ ' ' 



*6 is a measure whjch combmes the significance |evel and the 
power level. ^ ^ - ^ . • 

o . 20 



Flfialiy» in the unfortiinite event that one is "stuck'; witfi a 
*'grab sample^* of lubJe^U wtiich may be^ more or les^, than 
opUmai in number, ohe cart at least determine the power for the 
*Vaila^' N" and discover if experimental conditions actually 
provide reasonable protection against Type II Enor. Suppose that 
the total pool of subjects consists of 30 students (15 pcr> 
treatment group) and we want to test the null hypothesis againjt 
the same specific alternative hypothesis of a ten poirit difference 
in favor of tekvtejl instniction, with' a« 7.66 and a =.05. From 
equation (l)'of| page 14 we have > • ' 

X - ' 2:3 26(7.661 , ; ' ' 

« 4.60 ^ . , J* , _ 

^Substituting ,tl}ls in the initial fomiuiatlon.for Equation (2) 
from page 14: ' 

^ ^ . ' ^ r ■ • - 

4.60 - 10 « >git *^ . 
• 1.414(7.66) 

*A z of ~1 .931 cuts off i^lLpI' the left hand tail of the nonnal 
dbti1bution.Hierefor^ B = .OTand power = .97. 



IV. WHERE HAS ALL THp POWER GONE? ^ 

• The .two 'most*" frequently referenced statistics books In the 
contemporary experimental research literature of education and 
psychoTogy are those by Winer (i 962) and Hays (1963).' Both 
texts cQfitaln excellent discussions of very sinlple procedures for 
^"determining the Appropriate' sample siza^^Hor a desired level of 
^ statistical power ('1-6), given a speciflc »t2mative hypothesis 
' '(llj) of interest and significance level (a) chosen to test the null 
hypothesis (Ilg). As Chandler (1957) so ^ptly pointed out, power 
^ **...tM basic concept responsible for one's employing stattstical 
testsisV basis for taking action on an H(ypothesis)." If power 
, were of no consequence we could adopt the arbitrary convention 



a ^ni 



-18- 

or rejecting a ^ndomly selected S% or all null hypotheses. 
However, aside from an occasional rationalization of failure to 
conflrm some pet alternative )o the null, consideration of (lower 
is conspibuous by its absence from research'discussions. Investi- 
gators continue to caoi^ but. mean-difference tests on as many 
handy subjects 'as they can theft hands on and feasibly 
^'accommodate* . , • . ' 

' ' • . ' 

Why has. this apparently crucial aspect of the hypothesis- 
testing approach jto statistical inference been so consistently 
ignpred? • . ^ , 

K The cynic would dalm that rejeciting Hq, whether il'be t^e 
or false, has become a matter of personal survival. Power must 
succunib toii^more important considerjptions such as finding 
^ signjflcanC differences, breaking into print, and obtaining^ s^^ 
• inc^ase^^and promotions. We reject this accysaUon,' perhaps 
naively, s^ice we take.a more optimlsjUc view of the dedication of 
. behavioral scientists' to the .advanc^ent of knowledge. 

Z The defeatist would . argue that the combination of high 
^ power and a stringent significance level requires a prohibitively 
targe N. For any Hf that differs'oniv slightly from the null, th6 is 
. true. Yel tli^ literature contains examples of Research studies 
wliere investigators have i|sed more subjects than woidd be 
^necessary lo strike an optfmal balance among power, statistical 
signiRcance. and meaningfijl differences. Such studies ignore the 
int))lications that power holds for sample si^e, permit variations in 
sample size to govern the statisticahdecislon, and, as a conse^ 
quence, sacrifice relevance on the altar of significance. 

A 3.. The scholar w6u|d say that most investigatc^s fait to 
apprecidte the hypothesis-testing model in genersd and the matter 
of power in particular. Witness the virt|iat boycott of power 

* * ' * 
. *An analysis, of ten of \\\t most popular books on research 
melluHli reveals that Best, Borg. Kerlinger, Mouly, Sax, Tfavers, 
and Van Dalcn devote no^pace>Wiersma devote^ess than a page; 
Ikhmtadtcr allots three pages; and Fox gives fifteen pa^s to the 
Vj?onsidcratitin ofpowcT. . 

' - " ' * , 22 • 



concepts by texts devotei'to the pfmct^rfes of^es^arch design Jnd 
the conduct, of scientific inquiry,* despite, the clarity of presehta- - 
tions b)i Winer ind Hays, anc( despite the penodlc emergence of 
^concern about t*he 'susceptibility of decision-making behavioi; to 
Type ;i Error (e.g., Cohen, 1962, Kennedy, 1970). Failure to 
comprehend the simple notions associated with power and 
sample-size determjjiation is not at allcons^tent with wicfespread 
evidence of increasing sophistication in the ^complexities of 
analysis of covariance, multiple discnminant analysis, etc. Many 
.instructors, it is true, hesitate to talk ^bout power^in introducfory 
stat'l^tics courses,* feanng that students might find it too dificult. 
Instructors in more advanced courses** may assume; on the other 
hand,- that' the notion of po'wer is already part'of their students' 
statistical repertoires. / ^ 

^. The rigid ,empiricist would adopt the view that 'po^l'er is 
itself an object of inquiry rather tha^ a legitutiale topi of efficient 
experimeniaTdesign. He would label **un^crentific" the a prfori 
specificatron ^f ^ a difference that satisfies some criterion of 
practical ^ignifii||nce, H(B n)ight even be perturbed by the use of 
an estimate ^f th^ coitimon population variance drawn second- 
han(i from existing information sources. A pilot study at the^very 
least, perhaps even an e>^tensive preliminary sampling survey, are, 

— ^ - ■ \ 

^*The i)umber of pages devoted to power in ten of the most 
popular introductory statis\ics texts is as follows: Blomrners and 
Lindquist, 14; DoWnie an\i Heath, 0; Edwards' - Statistical 
Analysis, 2f Fergusof", 0, Game^^and Klare, 0; Garrett', 0; Guilford 
- Fundamental Statistics in psychology and Education^ 14; 
Popham, I; Tate, 1, and Walker and Lev • Elementary Statistical 
^Methods, 8. \ 

\ 

**In surveying ten texts which ^n best be described as either 
advanced or intermediate in cfifficulty, power received page 
allottmenh as follows: Cooley and Lohnes, 0; Edwards - 
Experimental Design in Psychological Research, 5; Glass and 
Stanley, 6; Hays, II; Lindquist - Design^nd Analysis of 
Experiments 1n Psychology and Education, 0; Marascuilo, 23; 
McNemar, 2; Walker and 'Lev - Statistical Inference, 7, Wert, 
.Neidt, and Ahmann, 0; and Wmer, 4. 



23 



-20- ^ . ^ « 

under most circumstances, an investigator's best recourse in the 
seiu:i;h for accurate parameter^ estimates. Yet experience suggests 
that the choice of a logical, realistic alternative hypothesis and a 
reasonable estimate of the population variarice can often be based 
oft other considerations. This matter is pursyed further in our 
concluding remarks. ^, 

'> 5. The Baypsian would suggest that the people who use the 
hypothesis-tesfing model have nojreal faith in its applicability to 
behavioral research. Consequently,, ^hey do a sloppy job of 
iiTiplenienting it. This appears, in some respects, the most 
convincing and insightful conjecture. The hesitance with which 
iniphCiitiuns and conclusions are extracted from observe_^d results 
betrays a woeful lack of trust either in the research results 
themselves or in the strength of the hypothesis-testing model as a 
framework for inquiry ^The reKictance of thie practicing educator 
or psychologist to take serioi^lly the implications of expehmentfd 
findings further minrors the researcher's own skepticism. The 
^iiiiple.ryct of the matter is that tTie hypothesis^-testing mqdel is a 
dichotomous^decision-making model. Its user goes "all the way"' 
with It. or Jie cripples it. Essential components of the model 
iiKlude. a) m;^ explicitly defined hypotheses (Hq and Hj); b) 
explicit knowledge or reasonable expectations regarding sampling 
distributions; and c) commitment to accept and act upon the 
coiiclu^on mandated by the statistical decision Perhaps there are 
very few important problems in education and psychology which 
ieiid theiiiselves to rigorous focus onlhe dichotomy; "reject Hq 
or rc|ect lj| "iWe'doubt it. But if sudi is the \ase, then pretense 
should be jbandonjed in favor of other modeU: . * 

/ y. CONCLUSION ' 

• / 

What is/ the tprescription for salvaging the hypothesis-testing 
model in/ those situations for which it is the appropriate 
prvK'odura^ At the, very least, it would appear^ the power 
u.ikul.i nan must beconK as integral a concern in the experimental 
proccs.s /is' the significance test itself The complementarity of 
T\po/l/and II errors should be suftlcient cause for at least 
iKkni^NYodiiiiii! the problem of power far more frequently. It is, 



aftey the scientist's responsibility to miitimize error - any 
erroir. The ^prevalent preoccupation with the avoidance of Type I 
En6r bespeaks a commendable concern foir the integrity of 
knowledge by subjecting Hi to Rigorous tests against Hq. 
* Yd the cause of science may be prejudiced far more gravely 
in/ the long run by the erroneous, and perhaps permanent, 
abandonment of a true Hj. By its very nature the incorrect 
rejection of Hq mvites ultimate exposure. Type ll Error, however, 
is more likely to escape detection. Nonethele§s, we would not 
advocate improving power at the expense of stringent sigftificance 
levels. The ideal expe;;iment is both powerful and rigorous. 

/ It is possible to»achieve any desired power level in conjunc 
/ with a specified significance level by controlling the number of 
I |xperimental ot>servations to be taken." For mean-difference 
research, the control of poyer through sample size determination 
depends upon the specification of a difference in iliagnitude that 
is meaningful^ t9gether' with ah estimate of the common- 
population variance for the dependent variable. ' ^ / 

There appears to be a curious reluttance in tfie behavioi/l 
sciences to take responsibility for specifying^ how large a 
difference will be regarded^ as a meaningful difference. Perhaps 
the attitude persists because concern for meaningful differences 
hints at a utilitariaif view of sciepce, or because a a priori 
specification of valued outcomes appears wanting in scientific 
objectivity. Whatever ^the reason, however, the test of'statistical 
significance has emerged as the major arbiter of the /value 
attached to phenomena observed in behavioral research. Yet 
within both basic and applied fields the interest vJiue of 
experiniental outcomes surely vanes (to some degre^ ii least) ' 
with the absolute magnitude of the differences obWryed, Rare 
trivia are no less trivialHhan the garden variety. , / / 



/ 



/ 



The abundance of anecdotes about tiTe enhancement of 
inconsequential outcomes by the artifice of increas/ng sample size 
attests to the n?e;l for non-statistical cnteria fo/ evaluating the 
miportance of research results The,, susceptibility of statistical 
inference to' the vij^ries of sample si/.e is, in liself,^ grounds for 
dismissing statistitflil significance as the sole cr/terion of substan- 



'25' • 



-22- ^ . . ' 

tive importance'. Presumably the competent researcher does have 
at least an. appreciation of his field X)r (Jiscipnne sufficient to 
differentiate i trivial from noa-trivial differences. If, he can not 
evaluate a hypothetical difference in the light of liis theoretical - 
constructs or in.relation to prevailing pjofo^n^ractice, there 
is faint hope foix^is ability to injterpre^^&robservedxdifference. 
The real lUUity of the significarice test is th6 assurance it can 
provide that evidence supporting some, interesting alternative to 
Hn is, at the.same time, a norf-chande event. 

beterniining in advance the brder magnitude that would 
distinguisfi an uninspiring di^ience front an exciting one offers 
two important advajU^^r Besides facilitating advance control * 
over Type il Etro/tolej^nce, the procedure effectively excludes 
from the region of rejection differences too small to be of 
Interest. The net ^fect is reasonably assurance of rejecting the 
null whell the observed difference i^ as large as the appropriate 
predetermined vakie and avoidance of the embarrassing respons- 
ibility for continuing plausible conclusions and implications from ' 
' inconsej5u,^itial',but statistically significant, results. ' 

Estimation pf the common population variance is empiric^ 
problem - one that is more tractable 'than is ^nerally believed. 
;J Many studies' employ instruments about which a great deal is 
^already known. Large* nurnbers of published tests that have 
' achievecl -popularity -as research instruments provide normative 
data based on large and reasonably representative, standardization 
nples. Many other research studies use unpublished measures 
k have been used before and for which there -is available a^ 
rough^^proximation to the popul^ion variance; And, finally, 
responsibly researclR which employs newly designed jn^tniments 
^ will make provision for the acquisition of reliability and validity- 
data that unrlude tlje variance estimate required for the power 
calculations. It would be unfortunate to i^ore such rich sources 
of information pimply because they were not groduced with the 
explicit intent of facilitating the control or evaluation^ of 
statistical power or. even worse » because (they were not a.product 
of the investigator*s own efforts. Ac|^,ov/^^^g^*g dependency on 
external sources ,of relevant statistical input might*6Ventually \ 
contribute t^^niinishing the fragmentation so charactdristjf of 

^ ' - ^ 26 ' • 

ERIC y, 




--23-- 

research products in the behavioral sciences. Present 9ustpni, 
characterized by emphasis on the uniqu^^ss of an author's 
contribution, tends to weaken, rather than strengthen, the 
continuity of science and {he relevance of new xliscoveries. c 

Given the importance of statistical power to the utility and 
integrity of difference testing as a decisive contributor to the 
advancement of knowledge, there is no roon^pfor equivocation. 
Estimating the common population variance, specifying a differ- 
ence that can be regarded as meaningfully large, selecting a power 
level that enforces reasonable restraints upon Type |KError, and 
calculating the sample size required to satisfy me ^wer 
specification should be as routine as the selection of a significance 
level. Moreover, like the selection of significance level, these tasks 
should precede the collection of a single piece of data. ^ 

But what if the investigator can honestly claim that for his 
study there is no intuitive, rational, or empirical basis *for 
stipulating some difference ''too good to miss"? The Utopian age 
of ratio scaling forecast f6r psychological measures by Wright 
(1968) has yet to dawn. As a result, the cautious investigator may 
be deterred })y the arbitrariness of his scale from attributing, 
substantive meaning to specific ^differenced or to variance esti- 
mates. The soluuon to thia problem rests with the concept of 
"effect size" exploited in Qohen's (1969) magm'ficent contribu- 
tion to the statistical HteraJ^re, a reference vVork devoted 
exclusively to power, complete with tables for sample size 
determination for virtually every commonly used test of signific- 

" ance. All onff4;eed do is specify the relative order of magnitude of 
a meaningful ^difference, choo§e the preferred significance and 
power iBVels; and read off the tabled sample size. Not only is the 
notion of eftect size an intuitively pleasing one (e.g. an: 
anticipated difference equal to half the pooled samples' standard 
deviation for the t-test employed with'^independent samples); it is 
a practical one. Cohen even provides the user with a rational basis 
for identifying for each statistic effect sizes thatlt^be described 
as small, medium, and large in the light pf results typically 
associated in the behavioral sciences with weak^ moderate, and 

^ potent experimental treatments. 



\ 



27 



^24- 

What pracdcal Implicatipns does advance determination of 
sample .size have for the design* and conduct of* research? If the 
investigator discovers- that the' sample required is tO(j large to be 
acconunodated by available resources, he has the, opportunity to 
avail himself of a number of reasojiable alternatives. If both- me 
power level and the significance level chosen are really crucial, the 
* study may be abandoned or deferred until a sufficient nurtiber of 
subjects can be mustered. It may be p6ssible, on the other hand, 
to effect an acceptable compromise between 'power and signific- 
ance by tolerating an increased probability of Type I Error in 
V order to bring Type II Error under sensible control. Or the 
investigator may tkke a calculated risk and proceed with thetx'' 
proposed Atudy^ fully aware of the extent to which his procedure 
is less than optimal? " ^ 

Suppose, however, that desired levels of power and signific- 
am:e do not drain ^ supply of available subjects, There-i§, 
obviously, nor scientific virtue in the extravagance of fruitlessly 
.large samples, fJarticularly if smaller groups might permit notable 
improvements upon the original research design by permitting 
inclusion of additional experimental treatments or the investiga- 
tion of interesting interactions. 

;Can experimental belipiotal science really afford the dubious 
luxury of continuing to blunder upoa significance, subject to the 
fortuitous coincidental attainment of sufficient statistical power? 




-25- 



ANNOTATED BIBLIOGRAPHY 

Campbell, D. T., and Stanley, J. C. ^^Experimental and Quasi- 
Experimental Designs for Research." In Gage, N. L. (ed). 
Handbook of Research on Teaching. Chicago: Rand McNally 
and Comj)any, 1963. ^ . 

r ^ 

This already^lassic ch^ter- on the non-mathematical 
, aspects of experimental design' is also available as a separate,^ 
. paperback with a slightly altered title (same publisher, y966)/ 

Chandler, R. "The Statistical Concepts oF-Cprfft^ce and^ 
Signmcancer Psychological Bulletin, 1957, 5V429-30. 

This very ^rief two-page article clarifies the distinction 
between significance (a term associated with the likelihood of 
getting a difference between ^ particular statistic and b para- 
meter, in hypothesis testing) and confidence (a term associated 
with the likelihood that a particular interval around a statistic 
covers tlie parameter, in interval estimation), ^e two tpftfs* 
are often confused with one another in the literature pain- 
ing to inferential statistics. 

4 

Cohen, J. "The Statistical Power of ^normal - Social Psycho- 
logical Research: A Review." Journal of Abnormal and Social ' 
Psychology, 1962,65, 145-153. 

I 

This article was the first of many efforts by Cohen to point- 
out the neglect of statistical power in behavioral research. The 
first, part of the article is a summary of the. basic concepts ^ 
involved in power analysis,, with examples. The second partis 
devoted to a critique of 78 articles which appeared in volume 
61 (1960) of the Journal of A^bnormal and Social Psychology, 
focusing on the question: "What kind of chance did these 
investigators have of rejecting false null hypotheses?" 

Cohen, J. Statistical Powr Analysis for the Behavioral Sciences. 
" New York: Academic Press, 1969. 



29 



-26- 

A textbook devoted entirely to the concept of pow^r^ 
complete with formulas and tables for determining either the 
sample size requifffl^for a given power or the power associated 
with a given sample size. All of the commonly-encountered 
significance tests - single sample mean, difference between 
two sample means (independent or correlated), sample correla- 
tion coefficient, etc. - are treated. ^ . 

Cooley, W, W., and Klopfet, L. E. Manual for Administering, 
Scoring, and Interpreting Scores - - Test on Understanding . 
Science, Form W, Princeton, New Jersey: Educational Testing 
Service, 1961. ■ ^ * ^ ^ 

The data provided in this manual (means, standard devia- 
tions, etc) ^re illustrative of the kinds of information a 
researcher might Ae6d if he were carrying out an experimental 
educational investigation for which performance on this test 
^ere the principal dependent variable. 

Hays, W. L. Statistics for Psychologists. ^New York: Holt, \ 
Rinehart, and Winston, 1963. . "> 

The very popular textbook used in many educational an4 
psychological statistics courses throughout the country. It has 
an excellent section on power (pp. 269-280). 

l(ennedy,'j. J. "A Significant Difference Can Still be Significant/ 
Educational Researcher, October 1970, 2, 7-9. 

One of many reactions to an article by Coats, in a previous 
issue of the same journal, objecting to the study of inferential 
statistics. Kennedy suggests, among other things; the tailoring 
of sample •size to the conditions 'of the experiment by 
employing statistical power analysis. 

Kerlinger, F. N. Foundations of Behavioral Research New York: 
Holt, Rinehart, and Winston, 1964. 




This popular text in research methods does not treat power 
as such but does bring 'to the researcher's attention some of 
•the basic -issues involved in statistical inference. 

,30 



— ' -27- 

Kingston, Charles R. "Applications of^ Probability Theory in 
Criminalistic^/; Journal of the American Statistical , Associa- 
tion, 1965, 60, 70-80. (a) . . 

<» 

Kingston, Charles R. "Applications of Probability Theory in 
f Criminalistics - IL" Journal of the American Statistical 
Association, 1965, 60, 1028 - 1034. (b) ' 

A pair of articles concerned with models for evaluating 
physical evidence for criminal trials. Power is not ekplicitjy 
consfdered, but the probabilistic basis on which it rests is 
treated in detail. . - , . 

Scheffs T. J. "Decision Rules, Types of Error, and Their 
Consequences in Medical Diagnosis." Behavioral Science, 
* 1963,8,97-107. 

As the title indicates, this article explores the relative 
consequences of Type I prror ("judging a well person sick") 
and Type JI Error ("judging a siqk person well") in the field of 
medicine. The autHi^r questions, the usually - unwritten 
assumption that Type II Errors are. .more serious in this 
context, whereas he supports the notion in the field of law 
that it is worse to convict an innocent person than^Tto let a 
guilty one go free. 

Welkowitz, J., Ewen, R. B., and Cohen, J. Introductory Statistics 
for the Behavioral Sciences. New York: Academic Press, 1971. 

Thi^ very new textbook is pne of the 4ew introduc^ry 
texts which contains more than "a page or t\vo on power. In 
Chapter 13 the authors treat both sample size and power 
determination Vor the four most commonly^ encountered 
significance tests, viz. single sample mean, difference between 
two independent sample means, single sample proportion, and 
sample correlation coefficient. 

Wner, B. J. Statistical Principles in Experimental Design. N§w 
YorH: McGraw-Hill, 1962^ ^ , 



31 



-28- ' '-^ . r 

This equally - popular (along with Hays) textbook also 
contains a clear* presentation of the basic notions of statistical' 
powfr, with specif relevance to single • clasilficatipn ("one- 
way") analysis of variance. ^ 

Wright, B, D, "Sample Free Test^Calibration and Person Measure- 
ihcnt." Proceedings of the, 1967 Invitatiolml Conference on 
Testing Problems. Princeton, NeWJersey: Educational Te^ljng 
:S^rvicc, 1968, ^ ^ ^ " ^ 

Ajpiost convincing plea for and <}escription' of a procedure 
devised by Rasch for freeing psychologicil measurement from 
particular instrument employed and. from previous mea- 
sures obtained. . , 

< / ' 



32 



Null XHq' -0) and Alternative (H^: y^-Pp-lO) SampU'ngi Distributions 

• . for Power -.99 (^,01) * ' 
♦ ShgnlfTiance level* (o)-.05: . 
^ •* . ^ Common population variance - 7.66 

/ ' ' .Sample size - 19 ' ' It 




0 



— ' 10 

.01 o-.OS ' ^ 



33 



ERLC 



( OCCASIONAL PAPERS 

1. THE PROBLEM Al*^ J>R0BLEM DELINEATION TECHNIQUES - 
William ^. G«phart, Mii Delta Kappa. Presented at th^ Second National 
Symposium for Professors ^ Educational Research, sponsored by Phi 
Delta Kappa, Boulder^ Colorado, November 21, 1968. A discussion of 
the nature of the concept "problem" as relaxed to educational research 

t with a discussion of several technique^ useful in problem identification 
and delineation. $1.00 ' • , 

2. A REVIEW or INSTRUMENTS DEVELOPED TO BE USED IN THE 
EVALUATION Of THE ADEQUACY OF RETORTED RESEARCH 

|ruce B. Bartol(^Phi Delta Kappyi^ Indiana University. Presented at ^ 
the Annual Meeting of the American Educational Research Associa- 
• . tion, February 1969, Los'Angeles, California^ A brief description and 
bibli^^grap^ic anno^tipn of 40 instruments developed tote used in 
assessing the methodological quality of completed research. $.25 ^ 
^3. PROFILING EDUCATIONAL Rj^EARCH - William J. Gephart, Phi 
. Delta Kappa, January 1969. The rationale for the development of a 
methodology pfoHle on completed research to show |ts strengths and 
.weaknesses. Included are flow charts for profiling the Hve facets o? the 
research process. $.75 . • . * 

,4. APPLICATION OF THE C0NVE;RGENCE XECHNIQUE .TO READ- 
ING - William J. Gephart, Phi Delta Kappa,/January ^f69/^n interim 
report on a research program planning^ effortnn the field of -reading. 
Free^^ ; ; 

5. THE CONVERGENCE TECHNIQUE AND READING : A PROGRESS 
' ' REPORT - William J.' Geptvart, Phi Delta Kappa. Presented at the 

. 'Annual Meeting of the international Reading A^ciation, May 2, 
4 1969, Kansas Cilj^, Missouri. A second interim report on the planning 
' oY a reading research program. Free 

6. THE EIGHT GENERAL RESEARCH METHODQIO®^^^ A FACET 
^^NALYSIS OF THE RESEARCH PROCESS - William J. Gj?pha^, 
Phi Delta Kappa, July 14, 1969. The identificajtion and description of 
general research mettiods in education through the use of Gutman'^ 
facet design and analysis technique. It also details the (Procedures for. 
the Gutman technique. This paper was printed in the procedlngs of the 

"Warsaw, Poland Congress of the International Association for the 
Advancement of Educational Research. Free 

7. PROFILING INSTRUCTIONAL PACKAGE - WilUam J. Gephart & 
Bruce, B. Bartos, Phi Delta Kappa, Auguat, 1969. An instruction^ text 
to assist individuals with no prior ^search training in i^he use^df 
research profiling flow charts to assess^the methodological adequacy of. 
completed research. $1.00 

8. EDUCATIONAL KNOWLEDGE USE - Gene V Glass, Labora^ry of 
Eiliicational Research, Uni^rsity of Colorado. An ^nlaysis of the 
availability and use of , empirically based information in education. 
$.50 , 

9. MEASUREMENT AND RESEARCH IN THE SERVICE OF EDUCA- 
TION - Warren G. Findley, Research and Development Center in 
Educational Stimulation, University oT Georgia. Originally presented as 
an invited address at the annual ^eeting of American Educatiotnal 
Research Associaiion, this pap^r uses an historical perspective to 
examine the^rote of measurement and research In education. $.75 



34 ' 



-al- 
io. THE EDUCATIONAL CATALYST: AN IMPERATIVE FOR TODAY 
~ Jo« H, Wtrd, Jr„ Rmv» Love, George >l, HIgginwn, Southwest 
Educational Oevetopme^nt Laboratory, Austin^Texa^, July, '1971.' An 
analysis of the problenii involved in the process of change and 
imprdvcment o( the practice of education. This paper poses a new 
professional speciality for the facilitation of empirically basod educa* 
iional improvements. $1.00* 
11. DISSERTATIONS YOU MAY WANT TO SEE - William J. Gephart. 
Phi .Delta Kappa, 1970. A collection of dissertations done in 1969 
which focus on research training. $.25 ' ' 
y}, THE DOCTORATE IN EDUCATION IN CANADA - NeviUe L. 
Robertson, Commission on Higher Education, Phi Delta Kappa, 1971. 
An analysis of the institutions offering the doctorate in education in 
Canada. This paper is a companion pi&ce to a larger study of similar 
institutions in the United States. $.75 
13. THE /IMPORTANCE OF STATISTICAL POWER IN EDUCATIONAL 
RESEARCH - John K. MiUcr, Thomas R. Knapp, University of 
Rochester. When an educational experiment results in non>significant 
differences can it be said that no difference exists? This paper 
discussed the concept that must be attended to IF that question is to 
be answered. It also details the procedure for determination of sample 
size needed In an experiment. $ L25 



ERIC .. 35 



